After data preparation and exploration, I proceeded with modelling.
As a part of the modelling approach for this analysis, I have elected to
implement two popular machine learning techniques: decision trees and
random forests. These methods were chosen due to their interpretability,
effectiveness in handling complex datasets, and their capacity for both
classification and regression tasks.
In each of these chosen techniques, two separate models were trained
to serve distinct predictive purposes. The first model targets the
prediction of profile views, while the second model aims at forecasting
profile likes. This dual-model approach was adopted in recognition of
the distinct factors that could potentially influence these two
different measures of user engagement. Each model is trained on a
different set of predictor variables, carefully chosen based on the
insights gathered during the data exploration phase.
The first step involves partitioning it into training and testing
subsets. I proceed to divide the dataset into training and testing
subsets. For this analysis, I have adopted the widely used practice of a
70/30 split, whereby 70% of the data forms the training set and the
remaining 30% is reserved for testing. This allocation ensures a balance
- ample data to train the model effectively, whilst retaining a
substantial portion for assessing the model’s performance with unseen
data. The code demonstrated below provides the method I employed to
execute this data split. Moreover, I undertook this process twice. This
resulted in two distinct sets - one set for profile views and another
for profile likes, enabling a targeted examination of each aspect of
profile engagement.
Decision Tree Model
Decision trees are a type of predictive modeling approach. It is
called a decision tree because it brings about a tree-like model of
decisions. In this instance, two decision tree models were constructed -
one for profile visits and another for profile likes.
For the profile visits model, three predictor variables were
considered: ‘isOnline’ (whether the user is currently online),
‘night_owl’ (whether the user is active during nighttime hours), and
‘age’.
The profile likes model, on the other hand, was slightly more
complex, considering a wider array of variables including ‘has_emoji’,
‘has_social’, ‘Profile_Views’, ‘counts_pictures’, ‘lang_count’,
‘flirtInterests_chat’, ‘flirtInterests_date’, ‘flirtInterests_friends’,
and ‘counts_details’. These variables were deemed to be potentially
relevant to the number of likes a profile receives.
Once each decision tree model was trained, the ‘summary’ function was
invoked to provide a comprehensive view of the models’ characteristics.
It includes details such as variable importance, split points, and node
summary, providing valuable insights into the models’ decision-making
process.
## Call:
## rpart(formula = Profile_Views ~ isOnline + night_owl + age, data = training_visits,
## method = "class")
## n= 2780
##
## CP nsplit rel error xerror xstd
## 1 0.09352518 0 1.0000000 1.0508393 0.01033357
## 2 0.01000000 1 0.9064748 0.9064748 0.01179770
##
## Variable importance
## isOnline night_owl age
## 77 14 8
##
## Node number 1: 2780 observations, complexity param=0.09352518
## predicted class=Low expected loss=0.75 P(node) =1
## class counts: 695 695 695 695
## probabilities: 0.250 0.250 0.250 0.250
## left son=2 (1658 obs) right son=3 (1122 obs)
## Primary splits:
## isOnline < 0.5 to the right, improve=29.722160, (0 missing)
## age < 21.5 to the right, improve=11.880220, (0 missing)
## night_owl < 0.5 to the left, improve= 3.611883, (0 missing)
## Surrogate splits:
## night_owl < 0.5 to the left, agree=0.671, adj=0.185, (0 split)
## age < 20.5 to the right, agree=0.640, adj=0.109, (0 split)
##
## Node number 2: 1658 observations
## predicted class=Low expected loss=0.6930036 P(node) =0.5964029
## class counts: 509 438 397 314
## probabilities: 0.307 0.264 0.239 0.189
##
## Node number 3: 1122 observations
## predicted class=High expected loss=0.6604278 P(node) =0.4035971
## class counts: 186 257 298 381
## probabilities: 0.166 0.229 0.266 0.340
## Call:
## rpart(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views +
## counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date +
## flirtInterests_friends + counts_details, data = training_kisses,
## method = "class")
## n= 2779
##
## CP nsplit rel error xerror xstd
## 1 0.3288201 0 1.0000000 1.0000000 0.01112283
## 2 0.1571567 1 0.6711799 0.6711799 0.01274569
## 3 0.1542553 2 0.5140232 0.5164410 0.01239983
## 4 0.0100000 3 0.3597679 0.3597679 0.01128688
##
## Variable importance
## Profile_Views counts_pictures counts_details
## 61 17 9
## has_emoji lang_count has_social
## 6 3 2
## flirtInterests_friends flirtInterests_chat
## 1 1
##
## Node number 1: 2779 observations, complexity param=0.3288201
## predicted class=Low expected loss=0.7441526 P(node) =1
## class counts: 711 686 689 693
## probabilities: 0.256 0.247 0.248 0.249
## left son=2 (1406 obs) right son=3 (1373 obs)
## Primary splits:
## Profile_Views splits as LLRR, improve=458.81820, (0 missing)
## counts_pictures < 3.5 to the left, improve=107.24340, (0 missing)
## counts_details < 0.02 to the left, improve= 43.05716, (0 missing)
## has_emoji < 0.5 to the left, improve= 22.35618, (0 missing)
## has_social < 0.5 to the left, improve= 16.14522, (0 missing)
## Surrogate splits:
## counts_pictures < 3.5 to the left, agree=0.677, adj=0.346, (0 split)
## counts_details < 0.75 to the left, agree=0.596, adj=0.182, (0 split)
## has_emoji < 0.5 to the left, agree=0.566, adj=0.122, (0 split)
## lang_count < 1.5 to the left, agree=0.535, adj=0.059, (0 split)
## has_social < 0.5 to the left, agree=0.534, adj=0.057, (0 split)
##
## Node number 2: 1406 observations, complexity param=0.1571567
## predicted class=Low expected loss=0.5007112 P(node) =0.5059374
## class counts: 702 551 149 4
## probabilities: 0.499 0.392 0.106 0.003
## left son=4 (692 obs) right son=5 (714 obs)
## Primary splits:
## Profile_Views splits as LR--, improve=249.316200, (0 missing)
## counts_pictures < 1.5 to the left, improve= 32.799080, (0 missing)
## counts_details < 0.02 to the left, improve= 14.828960, (0 missing)
## lang_count < 3.5 to the right, improve= 5.029280, (0 missing)
## has_emoji < 0.5 to the left, improve= 3.568267, (0 missing)
## Surrogate splits:
## counts_pictures < 1.5 to the left, agree=0.624, adj=0.237, (0 split)
## counts_details < 0.06 to the left, agree=0.571, adj=0.129, (0 split)
## flirtInterests_friends < 0.5 to the left, agree=0.546, adj=0.077, (0 split)
## has_emoji < 0.5 to the left, agree=0.525, adj=0.035, (0 split)
## flirtInterests_chat < 0.5 to the left, agree=0.520, adj=0.025, (0 split)
##
## Node number 3: 1373 observations, complexity param=0.1542553
## predicted class=High expected loss=0.4981792 P(node) =0.4940626
## class counts: 9 135 540 689
## probabilities: 0.007 0.098 0.393 0.502
## left son=6 (680 obs) right son=7 (693 obs)
## Primary splits:
## Profile_Views splits as --LR, improve=244.991200, (0 missing)
## counts_pictures < 9.5 to the left, improve= 18.498690, (0 missing)
## lang_count < 1.5 to the left, improve= 9.797405, (0 missing)
## counts_details < 0.785 to the left, improve= 6.305873, (0 missing)
## has_social < 0.5 to the left, improve= 5.557710, (0 missing)
## Surrogate splits:
## counts_pictures < 6.5 to the left, agree=0.613, adj=0.219, (0 split)
## has_emoji < 0.5 to the left, agree=0.554, adj=0.099, (0 split)
## counts_details < 0.75 to the left, agree=0.551, adj=0.093, (0 split)
## lang_count < 1.5 to the left, agree=0.541, adj=0.074, (0 split)
## flirtInterests_chat < 0.5 to the left, agree=0.529, adj=0.049, (0 split)
##
## Node number 4: 692 observations
## predicted class=Low expected loss=0.1604046 P(node) =0.2490104
## class counts: 581 105 6 0
## probabilities: 0.840 0.152 0.009 0.000
##
## Node number 5: 714 observations
## predicted class=Low Mid expected loss=0.3753501 P(node) =0.256927
## class counts: 121 446 143 4
## probabilities: 0.169 0.625 0.200 0.006
##
## Node number 6: 680 observations
## predicted class=High Mid expected loss=0.3691176 P(node) =0.2446923
## class counts: 7 134 429 110
## probabilities: 0.010 0.197 0.631 0.162
##
## Node number 7: 693 observations
## predicted class=High expected loss=0.1645022 P(node) =0.2493703
## class counts: 2 1 111 579
## probabilities: 0.003 0.001 0.160 0.835
The decision tree model for Profile Views was trained on
a dataset encompassing 2780 observations, using isOnline,
night_owl, and age as predictor variables. The
variable isOnline was deemed the most crucial, contributing
significantly to the total reduction of node impurity.
night_owl and age followed in importance. Two
key data divisions were created, reflected in the nsplit value, which
successfully decreased the relative error. The primary division criteria
revolved around isOnline, followed by age and
night_owl. Supplementary division rules, termed surrogate
splits, were also established.
Conversely, the decision tree model for Profile Likes
was developed using a wider array of predictor variables:
has_emoji, has_social,
Profile_Views, counts_pictures,
lang_count, flirtInterests_chat,
flirtInterests_date, flirtInterests_friends,
and counts_details. This model revealed
Profile_Views as the most significant variable, with
counts_pictures and counts_details next in
line. The decision tree model generated four crucial splits, each
progressively reducing the relative error. As with the first model,
primary and surrogate splits were established, with the former centering
around Profile_Views.
Each node in these decision tree models divulges key predictive
details. To illustrate, Node 2 of the Profile Views tree
houses 1632 observations. The predicted category here is ‘Low’, with the
anticipated misclassification rate (expected loss) around 0.69. The node
also provides a distribution of the target variable categories in terms
of probabilities. This step is consistently applied across all nodes and
both decision tree models.
The following code chunk applies the model to the test set:
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Low Mid High Mid High
## Low 214 171 155 136
## Low Mid 0 0 0 0
## High Mid 0 0 0 0
## High 84 127 143 162
##
## Overall Statistics
##
## Accuracy : 0.3154
## 95% CI : (0.2891, 0.3427)
## No Information Rate : 0.25
## P-Value [Acc > NIR] : 0.0000002124
##
## Kappa : 0.0872
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity 0.7181 0.00 0.00 0.5436
## Specificity 0.4832 1.00 1.00 0.6040
## Pos Pred Value 0.3166 NaN NaN 0.3140
## Neg Pred Value 0.8372 0.75 0.75 0.7988
## Prevalence 0.2500 0.25 0.25 0.2500
## Detection Rate 0.1795 0.00 0.00 0.1359
## Detection Prevalence 0.5671 0.00 0.00 0.4329
## Balanced Accuracy 0.6007 0.50 0.50 0.5738
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Low Mid High Mid High
## Low 253 48 0 0
## Low Mid 48 174 57 0
## High Mid 4 71 186 52
## High 0 1 53 246
##
## Overall Statistics
##
## Accuracy : 0.72
## 95% CI : (0.6936, 0.7454)
## No Information Rate : 0.2557
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6267
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity 0.8295 0.5918 0.6284 0.8255
## Specificity 0.9459 0.8832 0.8584 0.9397
## Pos Pred Value 0.8405 0.6237 0.5942 0.8200
## Neg Pred Value 0.9417 0.8687 0.8750 0.9418
## Prevalence 0.2557 0.2464 0.2481 0.2498
## Detection Rate 0.2121 0.1459 0.1559 0.2062
## Detection Prevalence 0.2523 0.2339 0.2624 0.2515
## Balanced Accuracy 0.8877 0.7375 0.7434 0.8826
The output presents the confusion matrices and statistics for the two
decision tree models’ performance on the testing data.
For the first model, it predicts four categories: Low, Low Mid, High
Mid, and High. The model predicted Low for all observations and didn’t
predict Low Mid and High Mid at all. As a result, the model’s accuracy
is low at 31.85%, with a 95% confidence interval between 29.21% and
34.58%. The kappa statistic is 0.091, indicating poor agreement between
the model’s predictions and the actual categories.
For each class, we can observe the following:
Class Low: The model has a sensitivity or true positive rate of
73.91%, meaning it correctly identified 73.91% of the Low class
instances. However, its positive predictive value (the proportion of
true positives in the predicted positives) is just 31.44%. It indicates
a high false positive rate. The model has a balanced accuracy of 60% for
this class, which accounts for both sensitivity and specificity and is
an overall measure of its performance.
The model didn’t predict Low Mid and High Mid at all, which
explains the zero values in Sensitivity, Pos Pred Value, and Detection
Rate, and NA in Pos Pred Value.
Class High: The model has a sensitivity of 53.36% and a positive
predictive value of 32.45%, indicating that the model struggles to
accurately identify and predict High class instances. The balanced
accuracy is 58.19% for this class.
In the second model, the overall accuracy improves substantially to
71.67%, with a 95% confidence interval between 69.02% and 74.21%. The
kappa statistic is 0.6222, indicating a moderate agreement between the
model’s predictions and the actual categories.
For each class, we can observe the following:
Class Low: The model has a high sensitivity of 82.95% and a
positive predictive value of 84.05%. The balanced accuracy is 88.77% for
this class, suggesting a good performance in identifying and predicting
Low class instances.
Class Low Mid: The model has a moderate sensitivity of 59.52% and
a positive predictive value of 62.50%. The balanced accuracy for this
class is 73.92%.
Class High Mid: The model’s performance decreases for this class,
with a sensitivity of 62.84% and a positive predictive value of 58.68%.
The balanced accuracy for this class is 74.12%.
Class High: The model performs well with this class, with a
sensitivity of 80.87% and a positive predictive value of 81.69%. The
balanced accuracy is 87.42% for this class.
The second model outperforms the first one in predicting the test
data, with a substantially higher accuracy and moderate agreement
between predictions and actual categories. However, there is room for
improvement, particularly in predicting the Low Mid and High Mid
classes.
Figure 9 below visualises the decision trees.
Random Forest Model
##
## Call:
## randomForest(formula = Profile_Views ~ isOnline + night_owl + age + genderLooking, data = training_visits, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 68.35%
## Confusion matrix:
## Low Low Mid High Mid High class.error
## Low 416 35 57 187 0.4014388
## Low Mid 342 40 54 259 0.9424460
## High Mid 298 43 54 300 0.9223022
## High 227 40 58 370 0.4676259
##
## Call:
## randomForest(formula = Profile_Likes ~ has_emoji + has_social + Profile_Views + counts_pictures + lang_count + flirtInterests_chat + flirtInterests_date + flirtInterests_friends + counts_details, data = training_kisses, importance = TRUE, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 27.78%
## Confusion matrix:
## Low Low Mid High Mid High class.error
## Low 583 119 7 2 0.1800281
## Low Mid 116 423 143 4 0.3833819
## High Mid 7 143 426 113 0.3817126
## High 0 10 108 575 0.1702742
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Low Mid High Mid High
## Low 163 135 117 92
## Low Mid 18 9 10 19
## High Mid 32 28 26 27
## High 84 126 145 160
##
## Overall Statistics
##
## Accuracy : 0.3006
## 95% CI : (0.2746, 0.3275)
## No Information Rate : 0.2502
## P-Value [Acc > NIR] : 0.00004694
##
## Kappa : 0.0676
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Statistics by Class:
##
## Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity 0.5488 0.030201 0.08725 0.5369
## Specificity 0.6152 0.947368 0.90258 0.6025
## Pos Pred Value 0.3215 0.160714 0.23009 0.3107
## Neg Pred Value 0.8041 0.745374 0.74768 0.7959
## Prevalence 0.2494 0.250210 0.25021 0.2502
## Detection Rate 0.1369 0.007557 0.02183 0.1343
## Detection Prevalence 0.4257 0.047019 0.09488 0.4324
## Balanced Accuracy 0.5820 0.488785 0.49491 0.5697
## Confusion Matrix and Statistics
##
## Reference
## Prediction Low Low Mid High Mid High
## Low 255 47 1 0
## Low Mid 44 167 57 3
## High Mid 6 78 183 53
## High 0 2 55 242
##
## Overall Statistics
##
## Accuracy : 0.71
## 95% CI : (0.6833, 0.7356)
## No Information Rate : 0.2557
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.6133
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Low Class: Low Mid Class: High Mid Class: High
## Sensitivity 0.8361 0.5680 0.6182 0.8121
## Specificity 0.9459 0.8843 0.8473 0.9363
## Pos Pred Value 0.8416 0.6162 0.5719 0.8094
## Neg Pred Value 0.9438 0.8623 0.8706 0.9374
## Prevalence 0.2557 0.2464 0.2481 0.2498
## Detection Rate 0.2137 0.1400 0.1534 0.2028
## Detection Prevalence 0.2540 0.2272 0.2682 0.2506
## Balanced Accuracy 0.8910 0.7262 0.7328 0.8742
The outcomes obtained from the two separate random forest models are
presented here may be interpreted as follows:
Profile Views Random Forest Model
This model aimed to predict ‘Profile_Views’ utilizing the features:
‘isOnline’, ‘night_owl’, ‘age’, and ‘genderLooking’. The training
process involved 500 decision trees, with each split in the tree
considering 2 variables.
An Out-of-Bag (OOB) error estimate, a commonly used internal measure
of the accuracy of random forest models, was computed to be 69.17%. This
indicates that the model could not accurately predict the outcomes in
about 69.17% of the cases when applied to the out-of-bag sample.
An examination of the confusion matrix reveals varying rates of
accuracy across the different classes. For example, the model exhibits
the most accurate predictions for the ‘Low’ category, as evidenced by a
class error rate of approximately 36.92%. Conversely, the ‘Low Mid’ and
‘High Mid’ categories showed substantial misclassification, reflected by
the exceedingly high class error rates close to 98%.
Profile Likes Random Forest Model
The second model sought to predict ‘Profile_Likes’ based on a range
of features including ‘has_emoji’, ‘has_social’, ‘Profile_Views’,
‘counts_pictures’, ‘lang_count’, ‘flirtInterests_chat’,
‘flirtInterests_date’, ‘flirtInterests_friends’, and ‘counts_details’.
Similar to the first model, this one was also trained using 500 trees.
However, at each split, this model considered 3 variables.
The OOB error rate for the second model is substantially lower at
27.45%, suggesting a better fit to the data as compared to the first
model.
Upon analyzing the confusion matrix, it can be observed that the
model demonstrated reasonable accuracy for the ‘Low’ and ‘High’ classes,
with class error rates of 17.44% and 16.71% respectively. Nonetheless,
the model encountered challenges with the ‘Low Mid’ and ‘High Mid’
categories, where the class error rates were 38.48% and 37.59%
respectively.
Results from Testing Data
The profile views model demonstrates an overall accuracy rate of
29.28%, indicating that it accurately classifies the data approximately
29.28% of the time. The sensitivity, or true positive rate, varies
considerably across classes, with the highest rate (60.74%) observed for
the ‘Low’ category and the lowest rate (1.68%) for the ‘Low Mid’
category. The specificity, or true negative rate, also varies, ranging
from 95.53% for the ‘Low Mid’ category to 55.26% for the ‘Low’ category.
These variations suggest differential model performance across
classes.
The profile likes outcome reveals a more satisfactory accuracy rate
of 70.91%. In this case, both sensitivity and specificity are more
evenly distributed across the classes, implying more consistent model
performance.
In conclusion, the analysis suggests that the second model is more
accurate and robust in making predictions compared to the first. It is
also important to note that both models show varying performance levels
when applied to different classes, which could be due to distinct
characteristics within each class that the models capture with varying
degrees of success.
Figure 10 below shows the importance of each variable in terms of
predictive power for the profile likes random forest model.
Discussion & Conclusion
The decision tree model demonstrates a level of efficacy; however, it
doesn’t fully capture the intricate relationships within the data. It
relies solely on one predictor, ‘profile views,’ to formulate its
predictions. While ‘profile views’ may be a critical factor, ignoring
other variables potentially diminishes the model’s performance.
In comparing the decision tree and random forest models, several key
points emerge that offer insights into their relative strengths and
weaknesses in this particular application.
Model Complexity and Understanding
The decision tree model has the advantage of being relatively simple
to understand and interpret. Each decision within the tree corresponds
to a question about one of the variables, making it a model that’s easy
to visualize and explain. However, this simplicity can also be a
limitation as it may not capture complex interactions among variables.
This may explain its less-than-satisfactory performance on certain
metrics, like sensitivity and specificity across various classes, and
overall accuracy.
On the other hand, the random forest model, which operates by
creating a multitude of decision trees and aggregating their results, is
capable of capturing more complex patterns and interactions in the data.
However, the trade-off is that it’s more challenging to interpret, as it
essentially involves a multitude of decision processes rather than just
one.
Robustness
Random forest models are known to be less prone to overfitting
compared to decision tree models. This is because they average the
results of many different trees, each of which is trained on a slightly
different subset of the data. This difference in robustness is likely a
contributing factor to the better performance of the random forest
models on the test data.
Computational Complexity
From a computational perspective, the decision tree model is less
resource-intensive, making it a more suitable choice for datasets with a
large number of variables or instances. Random forests, however, can
require significant computational resources, especially as the number of
trees increases.
In conclusion, while the decision tree model might be more easily
interpretable and computationally efficient, its performance in this
specific scenario was significantly outperformed by the random forest
models. This indicates that the random forest, with its ability to
capture complex interactions and its robustness to overfitting, was more
suited to this dataset. It’s a reminder that there’s always a trade-off
between interpretability and predictive performance, and the best model
depends on the specific context and the requirements of the
analysis.